DDHQ Election Fellow Data Report

Author

Amelia Minkin

Overview

The following report outlines the methodology utilized to obtain predictions for the margin of victory between Republican and Democratic candidates in all 2018 house races.

Packages used

Code
#General use
library(tidyverse)

#Visualizations
library(plotly)
library(usmap)
library(ggplot2)
library(kableExtra)

#Model building
library(tidymodels)
library(rpart)
library(yardstick)
library(rpart)
library(baguette)

Loading and Splitting the Data

Once I loaded the data into R, I split the data set into training and testing sets utilizing a split of 75% and 25%. I chose this proportion due to the decent size of the data set.

Additionally, I performed initial pre-processing steps, the further pre-processing steps are highlighted in its respective section. I initially did the following:

  • Transformed the victory margin to a numeric form

  • Excluded the 2018 observations from the training data since I will be predicting the victory margins, hence, there are no victory margin observations in 2018. I did not exclude these observations from the testing data since the model will be evaluated on the testing data.

  • Divided the victory margin by 100 since it is in percentage form and all of the predictor variables are in decimal form

LOADING IN DATA

Code
options(scipen = 100, digits = 4)
election_results<-read.csv("DDHQ_Data_Exercise-1.csv")
election_results$R.D.Victory.Margin<-as.numeric(election_results$R.D.Victory.Margin)
results<-subset(election_results,Year!="2018")
election_results$R.D.Victory.Margin<-election_results$R.D.Victory.Margin/100

SPLITTING DATA

Code
#Splitting data
set.seed(20201020)
result_split<-initial_split(data=election_results,prop=0.75)

#Training data
result_train<-training(x=result_split)
result_train<-subset(result_train,Year!="2018")

#Testing data
result_test<-testing(x=result_split)

Exploratory Data Analysis/Visualizations

To gain a more robust understanding of the data provided, in addition to exploring the data manually by doing some auxiliary regressions, I also explored the variance of the data to help inform the type of model I would create, the boxplot indicates high variance of the training data.

VARIANCE OF VICTORY MARGIN OVER TIME

Code
boxplot(R.D.Victory.Margin*100~Year,
        data=result_test)

Additionally, I created a visualization of the average vote margin between Democratic and Republican candidates in each state. I aggregated the house districts and found the average margin in each state throughout the time frame of 2006-2016.

The visualization can be seen here in both a stagnant and interactive form:

FINDING AVERAGE HOUSE VICTORY MARGIN PER STATE OVER TIME (2006-2016)

Code
averages<-result_train%>%
  group_by(state=State)%>%
  summarize(Average_Margin=mean(R.D.Victory.Margin*100))

STAGNANT VISUALIZATION

Code
margin_plot<-
  plot_usmap(regions="state",
             data=averages,
             values="Average_Margin",
             color="black",
             size=0.001)+
  scale_fill_gradientn(colors=c("blue","blue","white", "red"))+
  labs(title="Average House District Victory Margin By State from 2006-2016")+
  theme(legend.position = "right",
        panel.background=element_rect(colour = "black", fill = "white"), 
        plot.title = element_text(face="bold"))

margin_plot
Code
margin_plot

INTERACTIVE VISUALIZATION

Code
interactive_plot<-ggplotly(margin_plot)
Code
interactive_plot

Data Pre-Processing

To ensure the data could be used in a predictive model, I performed further data pre-processing. In addition to the initial pre-processing noted earlier (transforming the victory margin variable to be numeric and in decimal form and filtering the 2018 observations from the training data) when loading and splitting the data, I also created a recipe with the pre-processing steps to do the following:

  • Remove all string predictors: this removed the variables of “Race.ID”, “Chamber,” “State,” “Incumbent.Running.,” and “Geography” from set since the non-numeric variables will not be relevant in obtaining predicted values.

  • Remove “Congressional District” and “Year” variables: while these are numeric variables, they are not relevant predictors of victory margin.

To apply these steps to the data, I prepped the recipe and then applied it to the training data via baking it.

DATA PRE-PROCESSING

Code
#Creating recipe for pre-processing
margin_recipe<-recipe(R.D.Victory.Margin~.,data=result_train)%>%
  step_rm(all_string_predictors())%>%
  step_rm("Congressional.District")%>%
  step_rm("Year")%>%
  prep()

#Baking the recipe
bake(margin_recipe,new_data=result_train)

Building and implementing the model

I decided to utilize a decision tree model using bootstrap aggregation.

  • In performing exploratory data analysis, I noticed that in the training data, the victory margin has consistently high variance throughout the years. Given the context of the data representing all of the Congressional house districts across the country that have vastly different racial, political, and geographic demographics, this makes sense. Additionally, there are several occurrances of seats being unopposed. To account for the data’s high variance and the unopposed race outliers, and to reduce overfitting of the data, I utilized bootstrap aggregation for resampling and decision trees.

CREATING MODEL

Code
#Model
bag_mod<-
  bag_tree()%>%
  set_engine("rpart")%>%
  set_mode("regression")

#Workflow
bag_wf<-
  workflow()%>%
  add_model(bag_mod)%>%
  add_recipe(margin_recipe)

ESTIMATING MODEL

Code
#Fitting model on training data
bag_fit<-bag_wf%>%
  fit(result_train)

EVALUATING MODEL

Code
#Evaluating model on testing data
margin_predicted<-bind_cols(result_test,
                            predict(object=bag_fit,
                                    new_data=result_test))

summary(result_test$R.D.Victory.Margin)

IMPLEMENTING MODEL

Code
#Implementing model for prediction
implementation_predictions<-bind_cols(election_results,
                                      predict(object=bag_fit,
                                              new_data=election_results))

#Putting victory margin back into percent 
implementation_predictions$R.D.Victory.Margin<-implementation_predictions$R.D.Victory.Margin*100

implementation_predictions$.pred<-implementation_predictions$.pred*100

#Filtering implemented data to see results
implementation_predictions<-implementation_predictions%>%select(c("Race.ID", "Year","R.D.Victory.Margin",".pred"))

Model Evaluation and Interpretation

Through evaluating the metrics of the model, I found the RMSE to be 9.80, indicating that the predicted values on average, deviated 9.80% from the true victory margin in the races from 2006-2016. Additionally, I analyzed the predictions from 2018 and the model predicted a 243D/192R split of the house.

Following these evaluations, I looked at the actual 2018 results and found that the model was okay, not great, it was not completely accurate in predicting the margins of victory and the results.

This was expected for a multitude of reasons. Predicting the voting pattern of individuals within a house district is incredibly challenging. The voting behavior of individuals depends heavily on various factors outside of what was included in the model. Such considerations may include the candidate running and candidate/voter value alignment. Another major consideration that can be challenging to capture is the impact of redistricting. Redistricting was conducted in 2010 and thus impacted the racial and political makeup of house districts across the country. The change in racial and political composition of districts thus impacted the voting behavior of the district as the whole and as such, also impacted the margin of victory between candidates.

Some ways to account for these important and impactful considerations to improve the model fit and ability to accurately predict victory margins could be through (1) feature engineering to account for the the challenge of capturing voting patterns and (2) adding more predictor variables to make the model more robust (I will note, however, that it can be challenging to gather house district specific demographic data).

METRICS

Code
#Metrics
implementation_predictions %>%
  filter(Year!="2018")%>%
  metrics(R.D.Victory.Margin, .pred)
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       9.80 
2 rsq     standard       0.957
3 mae     standard       5.22 

PREDICTIONS (SEAT COUNT/MARGINS)

Code
#Filtering to see 2018 results
pred_2018<-filter(implementation_predictions,Year=="2018")%>%select(c("Race.ID", ".pred"))

#Seeing breakdown of results
pred_2018%>%count(grepl("-",.pred))
  grepl("-", .pred)   n
1             FALSE 243
2              TRUE 192
Code
#Seeing 2018 Predictions
pred_2018%>%
  kbl(caption="2018 Predicted Victory Margins")%>%
  kable_minimal("hover",full_width=F)%>%
  scroll_box(width = "300px", height = "200px")
2018 Predicted Victory Margins
Race.ID .pred
H2018AK00 20.9236
H2018AL01 30.2545
H2018AL02 15.2091
H2018AL03 30.0773
H2018AL04 55.6655
H2018AL05 40.6964
H2018AL06 52.9900
H2018AL07 -100.0000
H2018AR01 51.6264
H2018AR02 18.0336
H2018AR03 43.9236
H2018AR04 34.9864
H2018AZ01 -11.0845
H2018AZ02 12.5736
H2018AZ03 -40.2809
H2018AZ04 55.6764
H2018AZ05 27.5736
H2018AZ06 33.0291
H2018AZ07 -100.0000
H2018AZ08 28.9064
H2018AZ09 -17.7164
H2018CA01 19.7227
H2018CA02 -51.3282
H2018CA03 -22.2755
H2018CA04 22.7164
H2018CA05 -100.0000
H2018CA06 -100.0000
H2018CA07 -9.6136
H2018CA08 100.0000
H2018CA09 -19.4545
H2018CA10 12.2655
H2018CA11 -39.8064
H2018CA12 -66.0300
H2018CA13 -100.0000
H2018CA14 -57.7873
H2018CA15 -44.6809
H2018CA16 -13.8173
H2018CA17 -49.1445
H2018CA18 -45.7264
H2018CA19 -51.3273
H2018CA20 -100.0000
H2018CA21 4.6764
H2018CA22 28.0636
H2018CA23 27.6809
H2018CA24 -10.3127
H2018CA25 13.1173
H2018CA26 -26.3700
H2018CA27 -100.0000
H2018CA28 -52.7345
H2018CA29 -63.5136
H2018CA30 -41.1582
H2018CA31 -16.9400
H2018CA32 -45.3464
H2018CA33 -38.4564
H2018CA34 -100.0000
H2018CA35 -41.7973
H2018CA36 -24.9145
H2018CA37 -74.7627
H2018CA38 -45.6936
H2018CA39 8.8627
H2018CA40 -100.0000
H2018CA41 -34.2800
H2018CA42 31.2636
H2018CA43 -63.4973
H2018CA44 -100.0000
H2018CA45 7.2000
H2018CA46 -43.8809
H2018CA47 -28.2564
H2018CA48 17.5300
H2018CA49 7.3000
H2018CA50 29.1382
H2018CA51 -42.3064
H2018CA52 -19.9264
H2018CA53 -36.1309
H2018CO01 -39.8718
H2018CO02 -30.0627
H2018CO03 6.8427
H2018CO04 32.4282
H2018CO05 26.1336
H2018CO06 5.4027
H2018CO07 -15.2936
H2018CT01 -34.0009
H2018CT02 -17.4545
H2018CT03 -33.2564
H2018CT04 -28.3309
H2018CT05 -7.8427
H2018DE00 -14.9327
H2018FL01 44.5882
H2018FL02 36.0400
H2018FL03 16.4782
H2018FL04 36.4327
H2018FL05 -24.2255
H2018FL06 23.3018
H2018FL07 3.7327
H2018FL08 22.1118
H2018FL09 -10.0273
H2018FL10 -100.0000
H2018FL11 26.4118
H2018FL12 27.0764
H2018FL13 -4.8955
H2018FL14 -100.0000
H2018FL15 0.2000
H2018FL16 26.5473
H2018FL17 27.9300
H2018FL18 9.7009
H2018FL19 28.8418
H2018FL20 -100.0000
H2018FL21 -100.0000
H2018FL22 -16.2509
H2018FL23 -22.2845
H2018FL24 -100.0000
H2018FL25 5.3964
H2018FL26 6.5473
H2018FL27 5.9973
H2018GA01 28.8909
H2018GA02 -18.9164
H2018GA03 39.9145
H2018GA04 -45.6682
H2018GA05 -100.0000
H2018GA06 19.4455
H2018GA07 25.3036
H2018GA08 100.0000
H2018GA09 57.8564
H2018GA10 32.7827
H2018GA11 23.4236
H2018GA12 26.5955
H2018GA13 -46.7964
H2018GA14 59.5400
H2018HI01 -38.0491
H2018HI02 -42.8691
H2018IA01 20.4527
H2018IA02 -12.9264
H2018IA03 6.8418
H2018IA04 27.6300
H2018ID01 40.2145
H2018ID02 32.9309
H2018IL01 -47.2936
H2018IL02 -63.3909
H2018IL03 -36.2027
H2018IL04 -64.9518
H2018IL05 -47.7836
H2018IL06 20.0664
H2018IL07 -74.3591
H2018IL08 -19.9700
H2018IL09 -35.5536
H2018IL10 -11.1827
H2018IL11 -23.4500
H2018IL12 19.3109
H2018IL13 1.2745
H2018IL14 15.2800
H2018IL15 52.9745
H2018IL16 24.6636
H2018IL17 -26.3073
H2018IL18 42.9064
H2018IN01 -42.9909
H2018IN02 26.0009
H2018IN03 33.0400
H2018IN04 30.3045
H2018IN05 31.6436
H2018IN06 44.7227
H2018IN07 -20.2800
H2018IN08 33.1055
H2018IN09 19.8464
H2018KS01 41.4791
H2018KS02 27.8600
H2018KS03 10.8309
H2018KS04 22.4845
H2018KY01 52.6691
H2018KY02 48.3364
H2018KY03 -19.6709
H2018KY04 40.9318
H2018KY05 58.0264
H2018KY06 23.2609
H2018LA01 56.8245
H2018LA02 -100.0000
H2018LA03 50.2909
H2018LA04 24.7491
H2018LA05 26.5182
H2018LA06 34.8555
H2018MA01 -100.0000
H2018MA02 -34.9745
H2018MA03 -30.7973
H2018MA04 -100.0000
H2018MA05 -40.7000
H2018MA06 -34.4118
H2018MA07 -100.0000
H2018MA08 -100.0000
H2018MA09 -20.2236
H2018MD01 30.0682
H2018MD02 -31.0936
H2018MD03 -30.8964
H2018MD04 -52.3327
H2018MD05 -34.9764
H2018MD06 -26.0936
H2018MD07 -52.1318
H2018MD08 -32.1473
H2018ME01 -31.7073
H2018ME02 7.2682
H2018MI01 9.2218
H2018MI02 27.9255
H2018MI03 22.0336
H2018MI04 26.9973
H2018MI05 -33.3936
H2018MI06 19.2936
H2018MI07 5.8027
H2018MI08 14.2200
H2018MI09 -19.7045
H2018MI10 36.1509
H2018MI11 18.9645
H2018MI12 -34.2436
H2018MI13 -100.0000
H2018MI14 -57.4982
H2018MN01 4.9291
H2018MN02 -0.5445
H2018MN03 12.0864
H2018MN04 -34.1536
H2018MN05 -47.1545
H2018MN06 38.1064
H2018MN07 6.9627
H2018MN08 1.3318
H2018MO01 -56.3473
H2018MO02 23.4182
H2018MO03 40.2373
H2018MO04 35.0245
H2018MO05 -15.7027
H2018MO06 37.4345
H2018MO07 46.6318
H2018MO08 54.6064
H2018MS01 32.4718
H2018MS02 -100.0000
H2018MS03 18.3191
H2018MS04 45.3327
H2018MT00 19.8564
H2018NC01 -41.4209
H2018NC02 24.4373
H2018NC03 100.0000
H2018NC04 -31.7527
H2018NC05 23.2818
H2018NC06 23.1155
H2018NC07 26.4800
H2018NC08 16.0736
H2018NC09 19.6391
H2018NC10 29.4127
H2018NC11 37.1864
H2018NC12 -38.3745
H2018NC13 3.5382
H2018ND00 42.4055
H2018NE01 31.7518
H2018NE02 4.3173
H2018NE03 52.1127
H2018NH01 2.1800
H2018NH02 -4.2091
H2018NJ01 -27.7091
H2018NJ02 10.0391
H2018NJ03 11.8291
H2018NJ04 28.4891
H2018NJ05 2.8409
H2018NJ06 -27.8927
H2018NJ07 11.5073
H2018NJ08 -60.2800
H2018NJ09 -42.1182
H2018NJ10 -75.5918
H2018NJ11 17.2736
H2018NJ12 -36.1645
H2018NM01 -18.3164
H2018NM02 10.4127
H2018NM03 -18.4518
H2018NV01 -42.4518
H2018NV02 15.2191
H2018NV03 -0.4364
H2018NV04 -10.8673
H2018NY01 13.4427
H2018NY02 10.4991
H2018NY03 -13.4591
H2018NY04 -19.2436
H2018NY05 -100.0000
H2018NY06 -100.0000
H2018NY07 -100.0000
H2018NY08 -100.0000
H2018NY09 -71.5936
H2018NY10 -57.3655
H2018NY11 17.9755
H2018NY12 -62.9582
H2018NY13 -82.6691
H2018NY14 -67.9118
H2018NY15 -83.1982
H2018NY16 -100.0000
H2018NY17 -100.0000
H2018NY18 -7.4718
H2018NY19 -1.9518
H2018NY20 -31.9564
H2018NY21 19.5945
H2018NY22 5.6236
H2018NY23 15.4000
H2018NY24 5.6118
H2018NY25 -20.2509
H2018NY26 -49.9591
H2018NY27 29.3445
H2018OH01 10.8836
H2018OH02 28.2809
H2018OH03 -39.5645
H2018OH04 27.9964
H2018OH05 36.0618
H2018OH06 48.4873
H2018OH07 30.6973
H2018OH08 33.6236
H2018OH09 -40.9436
H2018OH10 27.3609
H2018OH11 -58.2418
H2018OH12 30.7936
H2018OH13 -34.5827
H2018OH14 24.5000
H2018OH15 26.4673
H2018OH16 21.1464
H2018OK01 40.0345
H2018OK02 54.5745
H2018OK03 56.9700
H2018OK04 40.9564
H2018OK05 13.4773
H2018OR01 -28.6100
H2018OR02 37.5627
H2018OR03 -100.0000
H2018OR04 -18.1473
H2018OR05 -7.8564
H2018PA01 -7.7027
H2018PA02 -52.7173
H2018PA03 -10.4955
H2018PA04 -5.1582
H2018PA05 9.6345
H2018PA06 2.6782
H2018PA07 15.7891
H2018PA08 7.1600
H2018PA09 32.0145
H2018PA10 31.3682
H2018PA11 36.7445
H2018PA12 32.4018
H2018PA13 -5.4700
H2018PA14 -5.3309
H2018PA15 37.9745
H2018PA16 18.0064
H2018PA17 2.8273
H2018PA18 -90.4273
H2018RI01 -29.1282
H2018RI02 -30.7545
H2018SC01 16.1400
H2018SC02 25.0182
H2018SC03 48.2245
H2018SC04 26.5382
H2018SC05 19.8327
H2018SC06 -47.3318
H2018SC07 24.2818
H2018SD00 30.0882
H2018TN01 62.8855
H2018TN02 42.4755
H2018TN03 37.1945
H2018TN04 36.1682
H2018TN05 -27.8145
H2018TN06 49.6691
H2018TN07 41.9536
H2018TN08 40.9718
H2018TN09 -44.1327
H2018TX01 56.3627
H2018TX02 28.6182
H2018TX03 31.8482
H2018TX04 58.9718
H2018TX05 37.4882
H2018TX06 14.0882
H2018TX07 20.1127
H2018TX08 55.6082
H2018TX09 -100.0000
H2018TX10 23.3545
H2018TX11 62.2218
H2018TX12 34.9273
H2018TX13 63.3109
H2018TX14 22.8400
H2018TX15 -21.9955
H2018TX16 -47.0327
H2018TX17 24.1491
H2018TX18 -48.4364
H2018TX19 59.4518
H2018TX20 -100.0000
H2018TX21 26.5318
H2018TX22 23.2409
H2018TX23 1.5645
H2018TX24 22.9609
H2018TX25 22.0382
H2018TX26 34.1909
H2018TX27 6.0418
H2018TX28 -100.0000
H2018TX29 -43.2345
H2018TX30 -100.0000
H2018TX31 21.7736
H2018TX32 16.9127
H2018TX33 -48.5264
H2018TX34 -24.3336
H2018TX35 -35.2645
H2018TX36 59.6955
H2018UT01 40.4927
H2018UT02 26.9800
H2018UT03 45.2400
H2018UT04 20.7636
H2018VA01 16.3691
H2018VA02 12.1964
H2018VA03 -100.0000
H2018VA04 -25.6436
H2018VA05 9.2955
H2018VA06 30.1718
H2018VA07 11.6209
H2018VA08 -42.6291
H2018VA09 44.0218
H2018VA10 4.5100
H2018VA11 -40.9736
H2018VT00 -36.7964
H2018WA01 -12.3118
H2018WA02 -100.0000
H2018WA03 19.0327
H2018WA04 35.1536
H2018WA05 15.8118
H2018WA06 -18.4873
H2018WA07 -59.7109
H2018WA08 5.2936
H2018WA09 -100.0000
H2018WA10 -22.5145
H2018WI01 29.3327
H2018WI02 -100.0000
H2018WI03 -10.8436
H2018WI04 -45.2700
H2018WI05 34.0427
H2018WI06 17.0255
H2018WI07 25.4609
H2018WI08 25.6518
H2018WV01 44.8818
H2018WV02 35.3682
H2018WV03 54.0700
H2018WY00 42.0009

Appending the original data set

APPENDING WITH PREDICTIONS

Code
#Appending original data set
results_final<-election_results%>%
  mutate(R.D.Victory.Margin=
        ifelse(is.na(R.D.Victory.Margin),
                implementation_predictions$.pred,
                R.D.Victory.Margin))

DOWNLOADING CSV FILE

Code
#Downloading as csv
write.csv(results_final,"results_final.csv",row.names=FALSE)